-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Intel MKL] Use Shard function instead of Eigen device to parallelize Adam kernel. #26424
[Intel MKL] Use Shard function instead of Eigen device to parallelize Adam kernel. #26424
Conversation
This could reduce the memory access and get good cache locality for CPU. modified: - tensorflow/core/kernels/training_ops.cc - tensorflow/core/kernels/training_ops.h - tensorflow/core/kernels/training_ops_gpu.cu.cc Signed-off-by: Lu Teng teng.lu@intel.com
Eigen device expression can only update 1 variable once, but Adam needs to update 3 variables and uses 3 expression which would impact the cache locality of CPU. Here use Shard function to replace Eigen device expression. This patch is tested on NCF model which is in MLPerf 0.5 submission. |
pinging @ezhulenev for a review. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please also add a benchmark similar to https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/bias_op_test.cc, I'll need it to run performance testing internally.
To get better cache locality, use Shard instead of Eigen expression.
Also added a benchmark to test Adam performance.
Hi, @ezhulenev , I've refined the code and added a benchmark https://github.com/tensorflow/tensorflow/pull/26424/files#diff-0b9bd0c5daec98f25d2e15c9b8c0370cR200. My test result is: original:
|
BTW, the Tensor vectorization form has similar performance with the loop version, I guess maybe Eigen can generate same ASM internally, but I didn't dig too deep. |
length = length / size; | ||
} else { | ||
size = 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to divide the input size by the packet size, and do "manual vectorization". If it's desirable to have shard size (end-begin)
to be a multiple of a packet size, you can pass block_align
to parallelFor
(see https://bitbucket.org/eigen/eigen/src/4b28c8008901c6d760f48f26ee2e3423fd8a2b40/unsupported/Eigen/CXX11/src/Tensor/TensorDeviceThreadPool.h#lines-185). \
I think this should work:
[](Index index) -> Index { return Eigen::divup(index, packet_size); }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got some question when try to use this function, please see my comment below.
I guess after inlining it all might have been fused into a single loop by compiler. Anyway it's great that there is no performance difference and we can keep simpler code. |
When I try to use env: Intel Xeon skylake-8180, 56 cores current implementation:BM_Adam/8192/1 82133 8295 399.0MB/s 99.7M items/s with
|
That's strange. I'll try to reproduce it internally after it will be merged. |
I think the problem is in incorrectly computed cost, and Eigen sharding too much or too less. |
Hi @ezhulenev ,please take a look at the new commit, I fixed an error of computing cost - need to multiply The 'manual vectorization' is still better than |
…adam PiperOrigin-RevId: 239499372
This could reduce the memory access and get good cache locality for CPU.
modified:
Signed-off-by: Lu Teng teng.lu@intel.com